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ABSTRACT 

Aberrant microRNA (miRNA) expression is impli- 
cated in tumorigenesis. The underlying mecha- 
nisms are unclear because the regulations of each 
miRNA on potentially hundreds of mRNAs are sam- 
ple specific. We describe a novel approach to in- 
fer Probabilistic M/RNA-mRNA /nteraction Signature 
('ProMISe') from a single pair of miRNA-mRNA 
expression profile. Our model considers mRNA 
and miRNA competition as a probabilistic func- 
tion of the expressed seeds (matches). To demon- 
strate ProMISe, we extensively exploited The Cancer 
Genome Atlas data. As a target predictor, ProMISe 
identifies more confidence/validated targets than 
other methods. Importantly, ProMISe confers higher 
cancer diagnostic power than using expression pro- 
files alone. Gene set enrichment analysis on av- 
eraged ProMISe uniquely revealed respective tar- 
get enrichments of oncomirs miR-21 and 145 in 
glioblastoma and ovarian cancers. Moreover, com- 
paring matched breast (BRCA) and thyroid (THCA) 
tumor/normal samples uncovered thousands of 
tumor-related interactions. For example, ProMISe- 
BRCA network involves miR-155/183/21, which ex- 
hibits higher ProMISe coupled with coherently higher 
miRNA expression and lower target expression; on- 
comirs miR-221 /222 in the ProMISe-THCA network 
engage with many downregulated target genes. To- 
gether, our probabilistic approach of integrating ex- 
pression and sequence scores establishes a func- 
tional link between the aberrant miRNA and mRNA 
expression, which was previously under-appreciated 
due to the methodological differences. 



INTRODUCTION 

MicroRNAs (miRNAs) are small (—22 nucleotides) RNA 
molecules that base-pair with mRNA primarily at the 3' un- 
translated region (UTR) to cause mRNA degradation or 
translational repression (1). Recent studies have linked al- 
terations in miRNA expression with various cancers (2-3). 
Functional characterization of miRNAs depends on precise 
identification of their targets. Earlier developed miRNA 
target prediction programs are mostly based on sequence 
complementarity, evolutionary conservation, free energy 
and/or target site accessibility (4). Although useful, these 
sequence-based methods often suffer from high false pos- 
itive rate and are unable to capture sample-specific inter- 
actions. More recently developed methods have incorpo- 
rated mRNA and miRNA expression data generated by 
microarrays or RNA-seq to predict functional miRNA- 
mRNA interactions (MMIs). Despite diverse modeling ap- 
proaches, a majority of the expression-based methods rely 
on negative expression correlation between miRNA and 
mRNA. In terms of model complexity, these methods range 
from the simplest Pearson correlation to more sophisticated 
Bayesian method. In particular, GenMiR++ is based on 
variational Bayesian to infer the posterior probabilities of 
MMIs as represented by the linear coefficients in a regres- 
sion framework (5). Regularized least-squares linear regres- 
sion such as LASSO has also been used to calculate a sparse 
linear solution of the most significant MMI (6). 

While a step forward from the sequence-based meth- 
ods, there are two important limitations in the current 
expression-based methods. First, these methods usually re- 
quire a large number of samples to compute MMIs. Thus, 
they have difficulty in identifying 'personalized' MMIs in 
individual samples. Indeed, each tissue or cell line has a 
unique miRNA regulatory network with weighted MMI 
edges, which can be used as molecular signatures similar to 
the uniqueness of mRNA/miRNA expression profile (2,7). 
Second, while most methods take into account the potential 
competition among miRNAs for the same mRNA in regres- 
sion models, the reciprocal competition among mRNAs for 
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Figure 1. The proposed competition model schema. (A) Expressed seed 
match of mRNA / for miRNA k is defined as the product of the num- 
ber of target sites q, k and the total expression of mRNA /'. The prob- 
ability of mRNA i 'attracting' miRNA k takes into account the expres- 
sion of miRNA k and the total expression of other mRNA that carries 
compatible seed match for miRNA k. (B) Conversely, the expressed seed 
k for mRNA i is the number of target sites that miRNA k can recognize 
on mRNA i multiplied by the expression of miRNA k. The probability of 
miRNA k targeting mRNA ; considers the 'total' expression of mRNA i 
and all of the other miRNA expression z that can recognize the target sites 
of mRNA i. (C) A toy example. The inputs are 10 simulated mRNAs and 
4 miRNA expression and a 10 x 4 seed-match matrix. ProMISe estimates 
the total mRNA expression based on the mRNA competition. It also in- 
fers the miRNA competition based on the total mRNA and the observed 
miRNA expression. The joint competition is the element-wise product of 
the above two matrices. Colored rectangles highlight some model proper- 
ties explained in the main text. 



the same miRNA has not been systematically addressed. 
Yet both competitions are experimentally supported. For 
the former, not only the endogenous miRNAs may compete 
for the same mRNA harboring overlapping seed matches 
but also for the limited Argonaute (Ago), the catalytic com- 
ponent of the RNA silencing complex (RISC) (8). For the 
latter competition, Arvey et al. (9) showed that miRNAs 
that have a higher number of available target transcripts will 
downregulate each individual target gene to a lesser extent 
than those with a lower number of targets. In other words, 
the affected mRNA target population 'dilutes' the individ- 
ual miRNA effect by sharing target sites among them. 

In this paper, we describe three related models via a 
novel approach inspired by a 'role-switch' analogy. The 
first (and second) model, namely 'mRNA competition' (and 
'miRNA competition'), takes into account the competi- 
tions among mRNAs (and miRNAs) for the same miRNA 
(and mRNA) using paired expression profile coupled with 
target site information (Figure 1). The third model 'joint 
competition' combines the former two predictions as joint 
probabilities. Using the expression data from (10) and The 
Cancer Genome Atlas (TCGA) (11), we first assess the 
proposed models as a target prediction tool by bench- 
marking the confidence or validated targets. The proposed 
models and the resulting probabilistic scores collectively 
termed as the Probabilistic M/RNA mRNA /nteraction 
Signature (ProMISe) confer competitive performance com- 
paring with existing sequence- and regression-based meth- 
ods. Furthermore, ProMISe signature exhibits competitive 
diagnostic power in discriminating normal/tumor profiles 
compared with using expression profiles alone. One ex- 
planation for the above observations is that ProMISe can 
capture complex MMIs not easily identified by examin- 
ing expression profiles alone. For instance, some specific 



MMI changes in tumor are not only due to the expres- 
sion changes of the corresponding miRNA/mRNA mem- 
bers but also due to the expression changes of the compet- 
ing miRNAs and/or mRNAs. Moreover, genes with aber- 
rant ProMISe signature are enriched for cancer-specific and 
oncomir-regulated gene sets based on gene set enrichment 
analysis (GSEA) (12). Some of the oncogenes with aberrant 
interactions do not exhibit significant expression changes. 
Thus, the inferred ProMISe signature can provide comple- 
mentary information to the expression profiles. Integrative 
differential analysis of expression and ProMISe signature 
using matched tumor/normal samples from breast and thy- 
roid cancers revealed many tumor-specific MMIs, involving 
canonical oncogenes and oncomirs. Together, our integra- 
tive approach bridges the aberrant miRNA and gene expres- 
sion profiles with miRNA targeting mechanism, which en- 
ables us to explore cancer biology from a unique molecular 
perspective. 

MATERIALS AND METHODS 

ProMISe 

We propose a novel probabilistic approach to infer prob- 
abilistic miRNA-mRNA interaction signature (ProMISe) 
using a 'single' pair of miRNA-mRNA expression pro- 
file. Assume we have N mRNAs and M miRNA. Let x 
and z be the observed mRNA and miRNA expression vec- 
tors, respectively, and C be the N x M seed-match matrix, 
where c i: k denotes the number of target sites on mRNA i 
for miRNA k. Assuming the cell is at a condition-specific 
equilibrium state, where mRNA are being dynamically tran- 
scribed and degraded (partly due to miRNA binding), then 
the total amount of transcribed mRNA targets x (r) are un- 
observed and higher than the observed (equilibrium) ex- 
pression level x (o) . Our goal is to infer p(t it k\x^\ z ( '\ C) for 
whether miRNA k targets mRNA i, given the (hidden) to- 
tal transcription levels of mRNA and miRNA. Notably, 



,(/) 



M = 



z, assuming that the observed miRNA expres- 
sion is the same as its total transcribed levels. Thus, we drop 
the superscript for z in the following formulation. Here we 
estimate p(t i: i ( \x ( '\ z w , C) by three related models reflect- 
ing mRNA competition, miRNA competition, and joint 
competition, respectively. Let c ^ and c/,. denote the num- 
ber of target sites of each mRNA for miRNA k and the 
number of target sites that mRNA has for each miRNA, 
respectively. In the mRNA competition model, we express 
p(t\ x k \y&, Zk, c.,k) as the probability of mRNA i 'attracting' 
miRNA k, given the expression of all of the mRNAs that 
miRNA k can target according to c. / c (Figure 1A). Thus, 
it reflects the mRNA competition for the same miRNA. 
Specifically, the probability that mRNA i attracting a spe- 
cific miRNA k is calculated as the reversed probability that 
miRNA k is attracted by other mRNA j (j ^ i): 



# c j,k x 



it) 



.(0 



(i) 



Note the use of the exponent z k from Equation (1) to reflect 
that the higher the miRNA k expressed the more likely it 
will be attracted to mRNA i. Conversely, the miRNA com- 
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petition model p{t\'l\x-p , z, c,,.) reflects the probability that 
miRNA k targets mRNA i, taking into account the expres- 
sion of all of the miRNAs that can target mRNA i accord- 
ing to c,,. (Figure IB). The probability that miRNA k target- 
ing a particular mRNA i is calculated as the reversed prob- 
ability that mRNA i is targeted by other miRNA / (/ ^ k): 



,c, ) = 1 



T^k CijZl 



(2) 



Thus, the first and second models respectively reflect the 
'dilution-effects' of multiple mRNAs targeted by the same 
miRNA or the mRNA competition (9) and the competition 
among miRNAs for the same mRNA (13). Finally, we ex- 
press the joint competition as p{tfl\x^, z, C): 

p{t\ j jM'\ z, c) = k&V 0 . **. c,*)^eSi4 0 . z ' c <-)- ( 3 ) 

Notably, an underlying assumption of Equation (3) is that 
mRNA competition and miRNA completion are indepen- 
dent events. Algorithmically, we estimate Equations (1-3) 
together in two phases. As initialization, we set x w = x (o) . 
We first estimate Equation (1). Given Equation (1), the ex- 
pected reduction level due to miRNA k binding is estimated 
as 



„(0 



(4) 



where -n is the 'learning rate' (default: 0.001). Given Ax,;/ f , 
the total transcribed mRNA is updated in two steps as fol- 
lows: 



x W * = x« 



Jf) 



i.k 



X] = 



x {,) * 



(6) 



where Equation (5) reflects the total reduction of mRNA 
i by each miRNA and T defines the transcriptional capac- 
ity of the cell (default: T=l.3J2 xf\ where L3 

is an ar- 
bitrary value that reflects the total transcribed mRNA 'be- 
fore' miRNA repression). We then estimate Equation (2). 
The model alternates between estimating (1) and (2) until 



p(tf^\x w , Zk, c. ,/ f ) and p{tYl\x)'' , z, c,-,.) increase by less than 
a threshold (tol) (default: 10~ 5 ) at f th iteration: 



max 



z k , Ck)' - P(t) 



max 



,c..j0 



(z) i At) 



X- , z, c 



] < tol 



Target site information 

We downloaded human target site information from Tar- 
getScanHuman 6.2 database (14). For each mRNA 
miRNA pair, we calculated the number of corresponding 
conserved target sites. For multiple transcripts of the same 
gene, we used transcripts with the longest 3'UTR. The end 
result is an N x M seed-match matrix of TV distinct mRNAs 
each corresponding to a distinct gene and M distinct miR- 
NAs. We also obtained the context+ scores (CS) (15) and 



probability of conserved targeting (PCT) (14) as sequence- 
based scores for comparison. 



HEK293 test set and power analysis 

Gene and miRNA expression from HEK293 as measured 
by serial analysis of gene expression (SAGE) and small 
RNA-seq were obtained from (10). We constructed the pos- 
itive and negative target sets from the PAR-CLIP and mi- 
croarray data generated by (10) following similar way de- 
scribed in (16). We first downloaded from doRiNAdb (17) 
the confidence AG02 targets identified from PAR-CLIP 
data in the same study. We then downloaded from Gene Ex- 
pression Omnibus (GEO) (GSE21577) the microarray data 
measuring gene expression in HEK293 after treated with 
mock control or a cocktail chemistry inhibiting 27 most 
highly expressed miRNA in HEK293. The true targets of 
the 27 miRNA are expected to exhibit increased expression 
level upon miRNA inhibition. Thus, the confidence posi- 
tive targets were defined as the genes that are AG02 tar- 
gets, have at least one seed match to the 27 miRNA, and 
exhibit fold-change greater than 0.1. To create a confidence 
set of non-targets, we selected the same number of genes 
that are not AG02 targets and exhibit non-positive fold- 
changes with priority given to genes with decreased expres- 
sion. We then assessed the accuracy of the methods using 
receiver operating characteristic (ROC) and precision-recall 
curves (PRCs) (18) (details described in the simulation tests 
in Supplementary Data). 



(5) TCGA data collection and processing 



Expression data were downloaded from TCGA Data Por- 
tal (https://tcga-data.nci.nih.gov). Only the processed data 
(level 3) were used. To date, there are 10 cancer types that are 
associated with data both unrestricted for publication and 
containing paired miRNA and mRNA expression (Table 
1). Except for glioblastoma multiforme (GBM) and ovarian 
serous cystadenocarcinoma (OV), for which the microar- 
ray data were used, RNA-seq(V2) and miRNA-seq were 
used for mRNA and miRNA expression data, respectively. 
Normal/tumor information for each sample were obtained 
from Biospecimen Metadata Browser (https://tcga-data. 
nci.nih.gov/uuid/uuidBrowser.htm) and mapped based on 
the sample ID. For the sequencing data, we used the RPKM 
(read per kilobase of exon per million mapped reads) values 
for mRNA and RPM (reads per million miRNA mapped) 
for miRNA. To ensure individual samples are comparable, 
we further quantile normalized the RPKM and RPM val- 
ues within each disease type. 



Validated miRNA targets, oncogenes, oncomirs, OMIM 
genes and cancer gene hubs 

Validated targets were downloaded from MirTarBase 3.5 
(19), the oncogenes from COSMIC (20), the oncomirs from 
(3) and OMIM (Online Mendelian Inheritance in Man) 
genes from http://www.omim.org/. Putative cancer gene 
hubs, which are discriminative of luminal and basal sub- 
types of 16 breast cancer cell lines, were obtained from (21). 
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Table 1. Paired expression data from TCGA 



Cancer 


n\TA 

mRNA 


miRNA 


Normal 


Tumor 


Total 


BRCA 


13306 


710 


14 


317 


331 


COAD 


13306 


710 


0 


177 


177 


GBM 


10344 


338 


10 


496 


506 


HNSC 


13306 


710 


0 


37 


37 


KIRC 


13306 


710 


0 


274 


274 


LUSC 


13306 


710 


0 


132 


132 


ov 


8371 


542 


8 


565 


573 


READ 


13306 


710 


0 


66 


66 


THCA 


13306 


710 


58 


485 


543 


UCEC 


13306 


710 


1 


124 


125 



BRCA: breast cancer; COAD: colon adenocarcinoma; GBM: glioblastoma multiforme; HNSC: head and neck squamous cell carcinoma; KIRC: clear cell 
carcinoma; LUSC: lung squamous cell carcinoma; OV: ovarian serous cystadenocarcinoma; READ: rectal adenocarcinoma; THCA: thyroid carcinoma; 
UCEC: uterine corpus endometrial carcinoma. 



Methods of comparison 

We compared ProMISe constructed from the three pro- 
posed competition models with six methods, namely Seed 
Matrix containing conserved target sites from TargetScan, 
TargetScan PCT, TargetScan Context Score, Pearson cor- 
relation, LASSO and GenMiR++. For Seed Matrix and 
TargetScan PCT/context score, which do not consider ex- 
pression data, the mRNA-miRNA interactions were sim- 
ply ranked by the corresponding number of target sites and 
specific scores. To be fair, we filtered out interactions involv- 
ing non-expressed miRNA beforehand. Pearson correlation 
was computed using R built-in function cor. Here targets 
with negative correlation were ranked at the top. We imple- 
mented LASSO using glmnet with default parameters (i.e. 
a = 1 for LASSO) except that the best X was chosen using 
cross-validation function cv . glmnet (22). The predictors in 
the LASSO model are the miRNA expression multiplied by 
the seed-match matrix. Thus, the expression of a miRNA 
has no effect on mRNA expression if the corresponding 
seed-matrix entry is zero. Here again targets with negative 
linear coefficients are ranked at the top. To run GenMiR++ 
(5) in Matlab, we converted the above seed matrix to binary 
matrix by setting nonzero target site count to 1 . Due to the 
lack of validated targets, conventional ROC and PRC ap- 
proaches cannot distinguish the method performances on 
TCGA data. Instead, we assessed each method by the num- 
ber of validated targets in their top ranked 500-2000 targets 
with 500-interval (Figure 2). 

Unsupervised learning on cancer data 

Hierarchical clustering was applied to expression profiles or 
ProMISe signature using R function hclust (Figure 3A, 
Supplementary Figure S5). One minus Pearson correlation 
and average linkage were used. 

Supervised learning on cancer data 

To classify samples into normal and cancer, we employed 
regularized logistic regression with Ll/L2-norm using R 
package glmnet (22). Specifically, we trained a linear model 
with a = 0.5 on training set using gene expression, miRNA 
expression, combined expression or ProMISe signature 
from the three competition models. In each training pro- 
cess, the only free parameter A. was determined by 10- 
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Figure 2. Identification of confidence or validated targets. (A) Based on 
positive/negative targets and the prediction scores, ROC and PRC were 
generated for the six methods, where Seed Matrix, TargetScan Context 
Score and PCT are sequence-based and the other three are proposed com- 
petition models. (B) The number of validated targets selected by each 
method among their top 1000 rankings as a function of sample. Ten pan- 
els correspond to 10 cancer types. The curves are smoothed using loess 
from R function geom_smooth. (C) Comparison of the competition mod- 
els with other expression-based methods including Pearson correlation, 
LASSO and GenMiR++. 




Figure 3. Cancer diagnosis. (A) Hierarchical clustering of expression or 
ProMISe signature for thyroid cancer. Red and blue colors indicate tumor 
and normal samples, respectively. (B) Regularized logistic regression was 
applied to classify normal and thyroid cancer tumor profiles using expres- 
sion or ProMISe signature. Mean cross entropy (MCE) from LOOCV was 
used to assess the performance of each method. Superior method confers 
lower MCE and thus higher -loglO(MCE). 
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fold cross-validation using cv.glmnet. The performances 
of the trained models on the training and testing set 
were rigorously evaluated by leave-one-out cross-validation 
(LOOCV) in terms of mean cross entropy (MCE) error: 

1 N 

MCE= -Jl-^SPn (7) 

n=\ 

where t is a binary indicator for tumor (1) or normal (0) and 
p is the predictive probability from the logistic regression 
model for sample n of being tumor. Thus, MCE is small if 
both t n and p n are high or low at the same time and large 
when t n is high and p n is small. We applied LOOCV to 
THCA, which has at least 10% normal samples (Table 1). 
Figure 3B displays the resulting MCE in barplots. 

Permutation tests 

The normal and tumor samples in THCA are not balanced 
(58 normals versus 485 tumors), which may lead to overesti- 
mation of model performance in terms of the MCE (Equa- 
tion (7)). To assess the classifiers' performances more rig- 
orously, we conducted 100 permutation tests by randomly 
swapping values between normal and tumor samples and 
repeating the above LOOCV for each type of biomarkers. 
The empirical />-value of the permutation test is defined as 
the fraction of tests having MCE as least as good as the 
MCE derived from the real test. 

Gene set enrichment analysis 

We applied GSEA (2.0.13) to both gene expression and 
genes with averaged ProMISe signature across miRNAs 
(12). Default setting was used. Gene sets were downloaded 
from the Molecular Signatures Database (MSigDB) (http: 
//www.broadinstitute.org/gsea/msigdb/index.jsp). The max- 
imum and minimum number of genes allowed in each gene 
set was set to 5000 and 15, respectively. 

Paired sample comparison 

Paired ?-test (t.test(. . . , paired = T) in R) was per- 
formed to compare quantile normalized ProMISe or ex- 
pression profiles from the 14 and 58 matched tumor/normal 
samples in BRCA and THCA, respectively. 

Software availability 

ProMISe was implemented as an R package available 
at Bioconductor: www.bioconductor.org/packages/release/ 
bioc/html/Roleswitch.html. 

RESULTS 

ProMISe 

The proposed model takes as inputs the paired mRNA and 
miRNA expression and the seed-match matrix. It then es- 
timates the probabilities of different mRNA 'attracting' the 
same miRNA (i.e. mRNA competition; Figure 1A, Equa- 
tion (1)), the probabilities of different miRNA targeting the 



same mRNA (i.e. miRNA competition; Figure IB, Equa- 
tion (2)), and the element-wise products of the above two 
(i.e. joint competition; Equation (3)). To highlight several 
important features of the proposed model, Figure 1C illus- 
trates a toy example using simulated data of 10 mRNAs 
and 4 miRNAs. First, mRNA i that does not carry a seed 
match for miRNA k has zero probability of being its tar- 
get (e.g. mRNA 1 and miRNA 2) (see Discussion section 

for other possibilities). Second, p(tj k \x^'\ Zk, c^) (mRNA 

competition) and p{tfl\x- , z, c it ) (miRNA competition) 
differ for the same pair of miRNA and mRNA. For in- 
stance, p(t^l\x { '\ Z4, c >4 ) = 0.26 and p{^]\xf, z, c 2 ,.) = 1. 
Intuitively, mRNA 2 has only one target site for miRNA 
4, which can potentially target many other mRNAs (verti- 
cal red boxes) such as mRNA 1 and 3, that each has higher 
expression level, two target sites, and thus higher probabil- 
ities of being targeted by miRNA 4. On the other hand, 
mRN A 2 can only 'attract' miRNA 4 because none of the 
other three miRNAs recognizes its target site (horizontal 
blue boxes), which explains p(q 4 \x^ , z, c 2 ,.) = 1. Third, the 
joint competition model (Equation (3)) provides a conser- 
vative estimate, for which both miRNA and mRNA com- 
petition scores must be high to confer a high confidence 
prediction (e.g. mRNA 10 and miRNA 2). The model con- 
verges quickly in only a few iterations for large number of 
mRNAs and miRNAs (Supplementary Figure SI). To rig- 
orously test the models, we designed four scenarios (Sup- 
plementary Figure S2). Below we compared ProMISe with 
other methods using real expression data. 

Target predictions 

We first compared ProMISe with sequence-based methods 
in discriminating the 1255 confidence positive and negative 
targets identified from HEK293 using the published data by 
( 10) (Materials and Methods section). We chose TargetScan 
Context Score (15) and Probabilities of Conserved Target- 
ing (PCT) (14) as the representative sequence-based meth- 
ods based on our recent evaluations (4). We observed excel- 
lent Area Under the ROC Curve (AUROC) from all three 
proposed competition models (Figure 2A, left). In partic- 
ular, the joint and mRNA competition models confer the 
highest AUROC (99.7%), which is 1-6% higher than the 
sequence-based methods. Additionally, we observed even 
better performance from the competition models in terms 
of precision recall (Figure 2A, right) with 19.4-45.9% im- 
provements over the sequence -based methods. Interestingly, 
the mRNA competition model outperforms the joint and 
miRNA competition models by 0.4% and 7%, respectively. 
We next evaluated each model on the TCGA data. To fur- 
ther take advantage of the sequence information, we mul- 
tiplied ProMISe with the TargetScan PCT. To be fair with 
the sequence-based methods, we filtered out interactions in- 
volving miRNAs that are not expressed in each sample be- 
forehand. Each panel in Figure 2B depicts the number of 
validated targets among the top 1000 ranking as a function 
of sample ID. The ProMISe constructed from the mRNA 
competition model demonstrated clearly superior perfor- 
mances over other methods in identifying validated MMIs 
(Figure 2B). The joint competition model achieves similar 
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performances as the mRNA competition model. Moreover, 
we observed consistent results using the top rankings from 
500 to 2000 with 500-increment (Supplementary Figure S3). 
We then compared ProMISe with three other expression- 
based methods, namely Pearson correlation, LASSO re- 
gression (6) and GenMiR++ (5). Because ProMISe con- 
fers good pairwise correlation between individual samples 
(Supplementary Figure S4), the averaged ProMISe signa- 
ture over all samples was used for each cancer type. All three 
competition models demonstrated superior performances 
in 9 out of the 10 cancer types while the mRNA or joint 
competition models achieve higher performance than the 
miRNA competition model (Figure 2C). Thus, ProMISe 
compares favorably with the existing methods, and yet has 
the unique advantage of constructing ProMISe signature 
from each individual sample. 

Cancer diagnosis 

We examined whether using the ProMISe signature can dis- 
criminate tumors from normal samples in comparison with 
using expression profiles. We first performed hierarchical 
clustering as an unsupervised approach. Indeed, for each 
cancer type ProMISe from all three competition models 
cluster based on normal (blue) and tumor (red) (Supple- 
mentary Figure S5). For instance, clustering of thyroid can- 
cer data, which consist of 58 normals and 485 tumors, is 
more consistent with the underlying phenotypes than clus- 
tering by expression (Figure 3B). We then performed regu- 
larized logistic regression on THCA and evaluated the test- 
ing accuracy by MCE error from LOOCV. As shown in Fig- 
ure 3B, the mRNA and joint competition models achieve 
the lowest MCE, which is significantly better than random- 
ized data based on permutation test {p < 0.01). Notably, 
however, the performances of each model on individual 
test cases tend to vary due to the heterogeneity of the tu- 
mor samples (Supplementary Figure S6). The effects of re- 
sampling on biomarker robustness and rigorous counter- 
strategies have been demonstrated by Li et al. (23). As a 
simple remedy, we focused our comparison on the matched 
tumor/normal samples to mitigate such effects in deriv- 
ing tumor-specific MMIs in sections below. Together, the 
additional leverage provided by ProMISe over the (com- 
bined) expression data suggests that our integrative ap- 
proach of modeling MMI provides useful diagnostic infor- 
mation complementary to the expression profile signatures. 
In the following analyses, we used the mRNA competition 
model to represent ProMISe. 

GSEA on ProMISe signature uniquely revealed abnormal ac- 
tivities of miR-21 and miR-145 in GBM and OV 

Aberrant miRNA and mRNA expressions have long been 
implicated in some specific cancer phenotypes (3). To ex- 
amine whether miRNA-dysregulated genes (MDGs) are en- 
riched for meaningful gene sets, we averaged ProMISe sig- 
nature of each gene over all of the miRNAs. Surprisingly, 
we not only recovered cancer-specific but also oncomir- 
specific target gene sets with false discovery rate (FDR) 
<0.05 (Supplementary Table SI). In particular, synaptic 
transmission (MSigDB ID: M19659), psychiatric disor- 



A. miR-21 exhibits higher activity in glioblastoma 




B. miR-145 exhibits lower activity in ovarian cancer 




Figure 4. Gene set enrichment analysis. GSEA was applied to the aver- 
aged ProMISe signature of each gene over all of the miRNAs in tumor ver- 
sus normal comparison (12). Enriched gene sets corresponding to miR-21 
and miR-145 targets were identified from the GBM and OV data, respec- 
tively. The enriched gene sets for GBM (OV) has its gene members accu- 
mulated at the top of the list ranked by ProMISe signature changes in tu- 
mor versus normal (normal versus tumor). The right panels illustrate the 
network view of specific miRNA targets. The filled color in the node corre- 
sponds to expression log2 fold-change in tumor, where red (blue) indicates 
increased (decreased) gene/miRNA expression. The edge color and width 
captures the ProMISe signature changes as Signal/Noise ratio as calcu- 
lated by GSEA, where red (blue) indicates increased (decreased) ProMISe 
signature in tumor. The network was generated using Cytoscape (26). 



ders (M2110), medulloblastoma (M16478), neuron cell- 
type specificity (Ml 712) and human brain aging (M9112) 
are among the top 10 ranked gene sets with lower ProMISe 
in GBM. On the other hand, gene set corresponding to 
the predicted miR-21 targets (Ml 9659) ranks the second 
among those exhibiting higher ProMISe in GBM (Figure 
4A, left). Previous studies have shown that the increased 
expression of miR-21 causes downregulation of tumor sup- 
pressor PDCD4, which stimulates cell growth across GBM 
cell lines (3,24). Indeed, we observed an increased ProMISe 
and a coherent decreased PDCD4 expression in tumor (Fig- 
ure 4A, right). Additionally, miR-21 also regulates RECK 
and TIMP3 to promote anti-apoptosis and migration of 
GBM cells (24). Notably, both ProMISe and expression for 
genes RECK and TIMP3 have increased in GBM tumors, 
which is in fact consistent with our intuition that the higher 
the miRNA/mRNA abundance the more likely they inter- 
act with each other. We applied the same analysis to ovar- 
ian cancer and discovered an over-representation of miR- 
145 target set (M15956), ranking third among the gene sets 
with lower ProMISe in OV. Indeed, miR-145 is downregu- 
lated in OV and its targets confirmed by luciferase reporter 
assays include CBFB, PPP3CA and CLINT1 (25). No- 
tably, CLINT1 has an increased expression and decreased 
ProMISe (Figure 4B)(26). 
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A. BRCA B. THCA 
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Figure 5. Paired sample test on matched tumor and normal in BRCA and 
THCA. (A and B) Volcano plots illustrate the significant interactions with 
—log 10(FDR) as a function of the averaged ProMISe log2 fold-change 
in BRCA and THCA tumors, respectively. (C and D) Three-way Venn dia- 
gram showing the overlaps among genes with differential expression (blue), 
genes with differential interactions (red), and putative cancer gene hubs 
(green) (21), respectively. Hypergeometric p- values are displayed near each 
pairwise overlap, and the common genes are displayed near the overlap. 



ProMISe-predicted MDGs in tumors are enriched for canon- 
ical cancer genes and cancer genes hubs 

Using the 14 matched tumor/normal samples from TCGA- 
BRCA data, we performed paired f-test to obtain differen- 
tial ProMISe signature and mRNA/miRNA expression in 
tumor versus normal. In total, 23 797 out of 56 805 MMIs 
exhibit significant changes in ProMISe at FDR < 0.05 (Fig- 
ure 5A), which involve 5690 out of 13035 and 103 out of 710 
distinct MDGs and miRNAs, respectively. At the same cut- 
off, we obtained 3073 and 40 differentially expressed genes 
(DEGs) and miRNA. We then assessed the overlap between 
the DEGs and MDGs and their overlaps with the puta- 
tive cancer gene hubs from (21). As illustrated in the Venn 
diagram (Figure 5C), the DEGs are significantly enriched 
for MDGs (p < le-447, Hypergeometric test). However, 
3285 MDGs are not DEGs, and 681 DEGs are not MDGs. 
Moreover, MDGs are significantly enriched for cancer gene 
hubs: 18 out of the 21 genes have abnormal ProMISe sig- 
nature in tumor (p <10~ 5 , Hypergeometric test). In con- 
trast, only 5 cancer gene hubs exhibit aberrant expression 
(p < 1). We repeated the same analysis on 58 matched 
tumor/normal samples for thyroid cancer (THCA) to ob- 
tain 35 453 out of 86 923 significant interactions at FDR < 
0.005 (Figure 5B), involving 6780 (113) MDGs (miRNA) 
and 4778 (58) DEGs (miRNA). Twenty of the cancer gene 
hubs exhibit abnormal interaction with miRNAs (p < 10~ 6 ) 
but only 13 of them have differential expression (Figure 
5D). Moreover, we compared DEGs and MDGs in both 
cancers with cancer-related genes from OMIM and ob- 
tained consistent results (Supplementary Figure S7). Thus, 
genes with differential MMIs may not have aberrant ex- 
pression in tumor and vice versa. It is possible that some 
of the tumor-related genes are more strongly regulated at 




Figure 6. Heatmaps using differential MMIs and expression profiles for 
BRCA and THCA. In each panel, three sets of heatmaps corresponding to 
ProMISe (from the mRNA competition model), miRNA and gene expres- 
sion were generated for differential MMIs involving putative cancer gene 
hubs from (21). Blue and red column-wise color codes for normal and tu- 
mor samples, respectively. Please refer to the main text for more details. 

the translational levels. In that case, our approach is still 
able to detect at least some of these translationally regu- 
lated genes since ProMISe exploits not only the gene expres- 
sion but also the miRNA expression as well as the sequence 
information. Moreover, the variability of gene expression 
profiles between individual tumors can be extremely high, 
which leads to low detection power of the true cancer sig- 
nals using gene expression alone (23). Together, the results 
suggest that we have gained statistical power by identify- 
ing more cancer-related genes using ProMISe than using 
expression alone. 

Integrative differential analysis on breast and thyroid cancer 
uncovered large cancer-specific MMI network modules 

To visualize tumor-related changes, we generated two sets of 
heatmaps for BRCA and THCA using the above differen- 
tial ProMISe signature and expression profiles involving the 
putative cancer gene hubs (21) (Figure 6) and OMIM genes 
(Supplementary Figure S7). Specifically, we performed hier- 
archical clustering on the selected ProMISe signature pro- 
files. As expected, normals (blue) and tumors (red) form two 
distinct clusters except for only a few misclassified samples 
in THCA. Notably, the cancer gene hubs were derived from 
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Figure 7. Network view of the 1257 filtered MMIs for breast cancer. 
The filled color in the node corresponds to /-statistics from paired /-test 
comparing tumor with normal, where red (blue) indicates higher (lower) 
gene/miRNA expression in tumor. The edge color and width captures the 
ProMISe signature changes in /-statistics, where red (blue) indicates in- 
creased (decreased) ProMISe signature in tumor. The blown- up network 
modules illustrate interaction changes of the BRCA-related oncomirs hsa- 
miR-155/21/183/145. 



breast cancer cell lines (21), which may explain the better 
clustering of BRCA samples. Interestingly, we observed a 
slightly more consistent clustering of THCA samples when 
using THCA-related genes from OMIM (Supplementary 
Figure S8B). The overall pattern suggests a positive correla- 
tion between the miRNA expression and ProMISe, whereas 
the correlation between gene expression and the ProMISe 
signature is less straightforward. On the one hand, genes 
with higher (lower) expression have higher (lower) like- 
lihood of being targeted. On the other hand, decreased 
(increased) MMIs may ultimately imply an increased (de- 
creased) gene expression. Accordingly, we devised the fol- 
lowing rule to retain only a subset of the tumor-specific 
interaction changes coherent with expression changes: (i) 
the tumor-specific differences in terms of ProMISe signa- 
ture, gene and miRNA expression must be significant; (ii) 
the change sign between ProMISe signature and miRNA 
must be the same; (iii) the change sign between ProMISe 
signature/miRNA and gene expression must be the oppo- 
site. Some of the qualified pairs are highlighted in purple 
boxes in Supplementary Figure S8. In total, we obtained 
1257 and 3255 qualified interactions for BRCA and THCA, 
corresponding to 748 (32) and 1679 (44) distinct genes 
(miRNAs), among which 42 (9) and 77 (15) are oncogenes 
(oncomirs), respectively (Supplementary Table S2). The re- 
stricted MMI forms two large network modules, harbor- 
ing genes with significant up- (red) and downregulation 
(blue) status associated with decreased (blue edge) and in- 
creased (red edge) ProMISe signature, respectively (Figures 
7 and 8). For instance, the increased expression of hsa-miR- 
155 /2 1/183 is coupled with increased interactions with its 
target genes, which exhibit decreased expression in breast 
tumor samples (Figure 7). Previous studies have shown the 
expression increase of the three miRNAs in breast cancer 
tissues or cell lines; however, their targetomes specific to 




Figure 8. Network view of the 3255 filtered MMIs for thyroid cancer. The 
blown-up images display subnetworks for hsa-miR-221/222/145. Other- 
wise please refer to Figure 7. 



breast cancer are not well characterized (3). On the other 
hand, BRCA-related oncomirs including let-7c and miR- 
145 are underexpressed in BRCA (Figure 7). Both let- 7c 
and hsa-miR-145 are tumor suppressor genes and modu- 
late motility, inhibit cell growth, and induce apoptosis of 
breast cancer cell lines (3,27). For thyroid cancer, our re- 
sults remarkably show that hsa-miR-22 1/222 are overex- 
pressed, consistent with the previous finding (3), and sig- 
nificantly downregulate a large cohort of genes including 
oncogenes ARID 1 A, ARNT, BCL11B, DICER 1, ELK4, 
TCF12, TCF7L2, ARID 1 A, ARNT, BCL11B, DICER 1, 
ELK4, TCF12 and TCF7L2 (Figure 8). Notably, miR-145 
is also downregulated in THCA coupled with upregulation 
of oncogenes BCR, EML4, MAP2K4 and TPM4. Thus, the 
reduced activities of miR-145 are prevalent in breast, ovar- 
ian and thyroid cancers based on our results, perhaps under- 
lining its important role in maintaining normal cell state. 
Together, our novel integrative approach enables discov- 
ering cancer-specific miRNAs and associating them with 
oncogenes at system level. 

DISCUSSION 

The existence of cell-type-specific expression 'signatures' 
of miRN A and gene has been previously appreciated (28). 
However, to our knowledge, there is no systematic method 
that identifies the MMI signature from each individual sam- 
ple. In particular, most existing methods are based on neg- 
ative correlation (5,29,30,31,32). These methods can only 
identify MMIs by aggregating multiple samples. In a real- 
world application such as cancers, however, the correlation- 
based methods are limited to identifying only the robust 
interactions due to the heterogeneity of individual expres- 
sion profiles. Additionally, most existing methods are in 
a regression framework, by treating mRNAs as response 
and miRNAs as input variables, which assumes that mR- 
NAs are independent of one another. However, mRNAs 
targeted by a common miRNA 'do' interact by competing 
for that miRNA (9). In this study, we describe a novel ap- 
proach and a new class of molecular signature collectively 
termed as ProMISe. In essence, ProMISe signature is an N 
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x M probability matrix for interactions between mRNA i 
and miRNA k. Notably, we do not normalize the proba- 
bilities so that each row or column sums to 1 because a 
miRNA can target multiple mRNAs with high probabil- 
ities and vice versa. Also, miRNA members in the same 
family (i.e. miRNAs with the same seed region) may be 
expressed differently. Thus, we keep them separate rather 
than combining them to reflect the competition within as 
well as between miRNA families. We explored three model 
formalisms, namely 'mRNA competition', 'miRNA com- 
petition' and 'joint competition'. Although all three mod- 
els compare favorably with the existing methods, our results 
more strongly support the mRNA competition model. Pre- 
sumably, the mRNA competition has more prevalent effects 
on MMI because there are more potential mRNA targets 
per miRNA than there are miRNA regulators per mRNA. 

The success of our approach is based on the very fact of 
the miRNA targeting mechanism: each miRNA molecule 
can only bind to and induces degradation of one mRNA 
molecule at a time. This concept fits nicely with the digital 
read count offered by the next-generation sequencing tech- 
nologies. In contrast, the same technique is not readily ap- 
plicable to model transcript factor (TF)-gene interactions 
since the enzymatic activities of each TF can differ widely 
regardless of its physical abundance at mRNA or protein 
level. On the other hand, we also made several simplify- 
ing assumptions. First, we assume that mRNA without any 
seed match to a particular miRNA will have zero probabil- 
ities of being targeted by that miRNA. Recent literatures 
suggest a 'seedless' model, where miRNA can bind to re- 
gions other than 3'UTR with imperfect matches (33). In- 
clusion of seedless matches, however, will introduce a large 
number of false positives and thus requires a more specific 
model. Second, we assume the binding of miRNA will al- 
ways induce the mRNA degradation, which is supported by 
a recent study via ribosome profiling after miRNA pertur- 
bations (34). However, some miRNAs may modulate gene 
expression primarily at the translational level (35,36). In 
that case, only the seed-match matrix and miRNA expres- 
sion are informative to inferring MMIs. By comparing ob- 
served mRNA levels with the proteomic data (if available), 
we should be able to more realistically capture the mode of 
translational repression. Third, we assume the binding effi- 
cacy is the same between any miRNA and any mRNA. It is 
more realistic to use different mRNA-miRNA binding effi- 
cacy, which can be estimated from miRNA over- or under- 
expression data (4). Fourth, we do not know the grand total 
amount of transcripts within the cellular capacity under a 
specific condition. In our model, we arbitrarily assume that 
the total Tis 30% (by default) more than the observed total. 
Fixing the total amount prevents ever-increasing individual 
total RNA levels estimated in each iteration but introduces 
a free parameter into the model. Since the miRNA target- 
ing primarily occurs in the cytoplasm, it may be possible 
to estimate the total amount T by simultaneously measur- 
ing the RNA in both nucleus and cytoplasm using recently 
developed RiboMinus RNA-seq technology (37). Fifth, we 
only consider mRNAs as competitors for miRNAs in our 
model. Several recent studies have focused on the interplays 
between miRNA and long noncoding RNA (IncRNA) in- 
cluding pseudogene (38) as well as circular RNA (circRNA) 



(39), which are collectively termed as the competing endoge- 
nous RNA (ceRNA). Thus, it would be more realistic to 
consider the seed match and abundance of ceRNA when in- 
ferring MMI. Finally, there are many other factors that we 
have not considered, which nonetheless influence the out- 
come and our interpretation of MMI. For instance, our 
model only considers the total expressed seed matches at 
3'UTR of mRNA for the same miRNA in the mRNA com- 
petition model and total expressed seeds of miRNA for the 
same mRNA in the miRNA competition model. It would 
be more informative to examine single nucleotide polymor- 
phism in tumor samples, which may disrupt (introduce) ex- 
isting (new) seed regions on miRNAs and/or seed matches 
on mRNAs. Moreover, aberrant expressions may merely be 
the consequence of copy number variation (CNV) at the 
genomic DNA level (21), which are not directly related to 
miRNA dysregulation. In tumors, some mRNAs may be 
regulated transiently by TF and/or miRNA, which may 
only be captured by single-cell RNA-seq (40). With more 
data becoming available from TCGA and elsewhere, we will 
examine these possibilities in future work. 

Comparing with other methods, the most distinct fea- 
ture of ProMISe is its ability to operate on a single pair of 
miRNA and mRNA expression profiles measured from the 
same individual. This unique ability allows us to construct 
ProMISe signature from each individual tumor or normal 
sample, perform cancer diagnosis using this novel molecular 
signature (Figure 3), and ultimately predict tumor-specific 
MMIs (Figures 7 and 8). For the latter, we focused our 
analysis on only the respective 14 and 58 matched tumor 
and normal samples in BRCA and THCA to mitigate het- 
erogeneity among cancer samples. However, a significantly 
higher (lower) likelihood of MMI in tumor (w.r.t. normal) 
may merely reflect the significantly higher (lower) expres- 
sion of miRNA and/or gene, which may in turn be the con- 
sequence of CNV (21). Although we show that genes in- 
volved in aberrant MMIs are enriched for meaningful onco- 
genic signature genes (Figure 5), it is important to distin- 
guish two scenarios: (i) aberrant gene expression is caused 
by miRNA dysregulation; (ii) abnormal MMI reflects aber- 
rant miRNA/gene expression. We designed a simple way 
to minimize the confounding effects caused by the second 
scenario. In particular, we focused only on the significant 
interactions with the same change sign as miRNA expres- 
sion change but opposite change sign to the gene expres- 
sion changes. Surprisingly, we were still able to obtain as 
many as 1257 and 3255 qualified interactions for BRCA 
and THCA, respectively (Supplementary Table S2). Visu- 
alizing these interactions in the network context revealed 
several meaningful network modules involving previously 
discovered and potential oncomirs and oncogenes (Figures 
7 and8). From clinical perspective, the decreasing cost of 
genome-wide large and small RNA profiling will facili- 
tate designing personalized medicine in small RNA ther- 
apy. Ideally, the unique ProMISe signature inferred from 
the paired expression profiles from an unknown sample will 
be compared with the categorized ProMISe signature estab- 
lished from existing data to identify disease-specific miRN A 
regulatory network modules. The knowledge link between 
miRNA and mRNA will enable designing drugs that con- 
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trol both the aberrant miRNA and gene expression syner- 
gistically. 
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